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While in the early 90s High Energy Physics (HEP) lead the computing industry by establishing the HTTP 
protocol and the first web-servers, the long time-scale for planning and building modern HEP experiments has 
resulted in a generally slow adoption of emerging computing technologies which rapidly become commonplace 
in business and other scientific fields. I will overview some of the fundamental computing problems in HEP 
computing and then present the current state and future potential of employing new computing technologies in 
addressing these problems. 



1. HEP Computing Problems 

The high energy frontier wiU soon be explored by 
four detector experiments recording the resuhs of col- 
hsions of the Large Hadron CoUider (LHC)located at 
the European Center for Nuclear Research (CERN) 
in Geneva, Switzerland. Designed to record, identify, 
and study the Higgs boson and a wide array of po- 
tential new physics signatures as well as B mesons 
and quark-gluon plasma, these experiments ultimately 
hope to observe an inconsistency between nature and 
the Standard Model (SM) of High Energy Physics. 
However, since the typical production and detection 
probabilities for interesting processes are eleven or- 
ders of magnitude smaller than that of uninteresting 
SM backgrounds, these experiments must sift through 
forty million events per second while the LHC is col- 
liding beams, recording roughly one in two hundred 
thousand events in order to accommodate bandwidth 
constraints. The resultant 1 to 2 petabytes (PB) of 
monthly data must then be processed and analyzed 
offline into a form which is suitable for extraction of 
measurements. 

The unprecedented scale of the required computing 
resources and the complexity of the computing chal- 
lenges have made computing an important element 
of HEP. Each LHC experiment has deployed its own 
"Computing Model" (CM), which consists of several 
classes ( "tiers" ) of facilities which together comprise a 
grid of resources held bound by fast links and special 
middleware. In ATLAS experiment, the Tier-0 facil- 
ity at CERN will perform the first-pass processing of 
the data. The ten national Tier-1 facilities will re- 
process these data with better calibrations within two 
months after data collection. Meanwhile, the roughly 
thirty Tier-2 facilities placed at specific universities 
and labs will focus on simulations and data analysis. 
In addition, considerable effort is directed towards de- 
velopment of: 

• Monte Carlo tools (which link theory and exper- 
iment), 

• detector simulation frameworks, 

• algorithmic and statistical analysis tools. 



• data processing frameworks and algorithms, 

• grid middleware, 

• data production and management systems, and 

• underlying data persistency and database infras- 
tructure. 

Ultimately the performance of these software compo- 
nents, many of which are used by multiple experi- 
ments, determines the scale of computing resources 
required by HEP experiments. 
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Figure 1: Percentage of ATLAS Tier 2 CPU required 
for simulation production as function of fraction of 2010 
recorded data which is fully and fast simulated. 

Despite the impressive scale of the LHC comput- 
ing grid and the sophistication of underlying software 
technologies, a dearth of computing resources will be 
one of the primary bottlenecks in extracting measure- 
ments from LHC data. For example, figure [1] shows 
the percentage of ATLAS Tier 2 resources required in 
2010 for fast and full simulation production as func- 
tion of the fraction of recorded data. The ATLAS 
CM model expects that physics analysis activity will 
require roughly half of the resources at ATLAS Tier 
2s, leaving the other half for simulation. But we see 
that the allocating the nominal 50% of Tier 2 re- 
sources to simulation limits the volume of fully sim- 
ulated data to roughly 20% of the data the ATLAS 
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detector will collect. This deficiency will fundamen- 
tally limit the significance of the comparisons between 
recorded data and theoretical predictions (via detailed 
detector modeling) which are necessary to make any 
statements about nature. Therefore ATLAS must rely 
on less accurate fast simulations to produce Monte 
Carlo statistics comparable to the recorded data. And 
if data needs to be resimulated, as is often the case in 
the first years of an experiment, CPU resources must 
be borrowed from analysis activity, thereby stalling 
extraction of measurements. 

In addition, the experiment CMs do not typically 
provide physicists the necessary resources for the 
CPU-intensive activities which in the past decade have 
come to characterize analyses of HEP data at the 
Tevatron and the B factories. These activities in- 
clude sophisticated fits, statistical analysis of large 
"toy" Monte Carlo models, matrix element calcula- 
tions, and use of the latest discriminant techniques, 
such as boosted decision trees. Physicists must there- 
fore rely on leveraged resources and emerging tech- 
nologies to accomplish such tasks. 



2. Solutions 

In tandem with these developing needs in high- 
energy physics, the role of computing in both busi- 
ness and everyday life has evolved significantly. In 
the early 1990s, the HTML protocol was developed 
in the context allowing the hundreds of collaborat- 
ing physicists in HEP experiments at CERN to com- 
municate with their global group of colleagues. This 
development was crucial to the later transformation 
of the Internet from an academic tool to a global 
medium. Today, computing is seen as a commodity, 
and a critical component of the world economy. Ma- 
jor companies like Google and Amazon manage huge 
data centers and sell CPU cycles, disk, and bandwidth 
by the minute and megabyte. Regular innovations in 
data processing, delivery, organization, and communi- 
cation continuously drive significant business and so- 
cial progress. 

Figure [5] compares the Google search volume for 
"grid computing" with technologies such as solid-state 
drives (SSDs), General Purpose Graphics Processing 
Units (GPGPU), Virtualization, and Cloud comput- 
ing. We see that while HEP has been building large 
and expensive grid sites along with the necessary grid 
software, Virtualization and Cloud Computing have 
developed a wider appeal. Fortunately, efforts to take 
advantage of these technologies have recently begun 
in HEP. 

Virtualization is likely to help address problems in 
HEP computing that can be traced back to the fact 
that HEP experiment software generally require spe- 
cific operating systems (OS) and are typically difficult 
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Figure 2: Search google volume for "ssd", "gpgpu", "vir- 
tualization", "cloud computing", and "grid computing" 
normalized to "virtualization". From Google trends. 



to install. Virtualization provides a means of provid- 
ing single file (ie a virtual hard drive image) which 
can be pre-packaged with all necessary software, in- 
cluding the OS, and can run on any modern "host" 
machine regardless of the host hardware and OS. The 
first benefit of virtualization is that average physicists 
can simply download such an file and then instantly 
be able to use their experiment's software on their per- 
sonal computer without a great deal of expertise. The 
CernVM [20| effort have been particularily successful 
in this area. Similarly system administrators can use 
virtualization to simplify deployment of the compli- 
cated set of software packages required by GRID sites. 
This route is particularly attractive for Tier 3 sites 
that do not necessary have a full-time system admin- 
istrator. But perhaps the most promissing application 
of virtualization is in the area of oppertunistic com- 
puting where idle desktop computers (for example, in 
university offices, class-rooms, and labs), can be used 
to assist data simluation. 

The appeal of Cloud Computing is the promise of 
on-demand access to vast computing resources hosted 
by companies that can provide attractive pricing due 
to the economy of scale. So, for example, a HEP ex- 
periment can short-term lease huge CPU resources for 
data simulation. It is noteworthy that Cloud Comput- 
ing has an implicit reliance on Virtualization for de- 
livering the appropriate software environment to the 
cloud resource. At this point, it is not clear that such 
Cloud Computing is appropriate for HEP. Current 
Cloud Computing offerings such as Amazon's EC2 are 
targeted for web-hosting rather than large data pro- 
cessing tasks and therefore offer prohibitive pricing 
and performance for HEP applications. What's more, 
the large Cloud Computing providers such as Amazon 
and Microsoft favor their own proprietry software so- 
lutions that lock their clients to their Cloud and have 
avoided open efforts such the "Open Cloud Manifesto" 
by IBM. In addition some in the industry view the 
Clound Computing vision as an unrealistic "myth". 
Nonetheless, many within the HEP community are 
exploring the potential of HEP computing. For exam- 



Proceedings of the DPF-2009 Conference, Detroit, MI, July 27-31, 2009 



3 



pie Nimbus now provides a means of turning Amazon 
EC2 resources into a self-configuring cluster for HEP 
computing. 

A very promising technology, that is now becom- 
ing cost-effective, is SSD. Because these storage de- 
vices have no mechanical or moving parts, they pro- 
vide impressive input/output (I/O) rates, in particu- 
lar huge gains in random read access in comparison 
to traditional hard drives (HD). Two high I/O HEP 
applications that can benefit from SSDs immediately 
come to mind. First is ntuple data analysis, a task 
that is typically characterized by rapid iterations over 
upto terabytes of data. The second is in computing 
sites where hundreds of processes running on multiple 
cores access data stored on a single storage device. 
Since SSDs present the same interface as HDs, their 
deployment in HEP environments is rather easy. The 
primarly limitation is then the additional cost of the 
SSDs, which we may expect to lead to SSD/HD hybrid 
solutions where HDs are used for long-term storage 
and data is prestaged to SSDs for faster access. 

We may also note that while HEP has tradition- 
ally interpreted Moore's law as predicting faster pro- 
cessors, the industry has shifted to more cores per 
processor. Thus we find that though many HEP ap- 
plications in principle lend themselves to paralleliza- 
tion, very few existing HEP software were fundamen- 
tally designed with parallel processing on multiple 
cores in mind. Recently, the omnipresence of dual 
and quad-core processors has prompted some paral- 
lelization efforts within the HEP community. These 
efforts take advantage of the embarrassingly parallel 
nature of HEP computations by simply running mul- 
tiple instances (or threads) of the software. The only 
challenge then is sharing memory across these par- 
allel threads in applications that have a large mem- 
ory footprint such as simulation and reconstruction. 
Assuming that memory cost isn't a factor, the prob- 
lem then becomes the processor to memory bandwidth 
which can be saturated when a large number of cores 
access large amount of memory. This bandwidth lim- 
itation constrains performance gain when scaling to 
large number of cores per processor. As a result, most 
HEP multi-core optimization efforts are concentrated 
on processing forking and thread safety. 

While Central Processing Units (CPUs) have been 
evolving towards more cores, Graphics Processing 
Units (CPUs) that were originally targeted to the per- 
sonal computer gaming enthusiasts have evolved to 
be capable of computing traditionally performed on 
the CPU. These modern processors that are also now 
present in many desktops and computers are known 
as the General Purpose Graphical Processor Units 
(GPGPUs). Despite that fact that for specific com- 
puting tasks, these GPGPUs can marshall thousands 
of simplified processing units in parallel in order to 
reduce computing times by orders of magnitude, they 
have yet to capture the attention of HEP as a whole. 



3. Simple ROOT Analysis 

A very simple study of read/ write rates of 
ROOT [l^ analyses illustrates the potential of SSDs 
and GPGPUs on the data analysis iteration rate. For 
this study, we consider two simple ROOT applica- 
tions, one which creates an ntuple (TTree) with ran- 
dom data (simple types such as bools, ints, floats, and 
vectors of simple types), and another that reads his- 
tograms all quantities in these ntuples. We find that 
simple read/write rates stablilize with about 20 vari- 
ables of each type per event (3 KB/cvent) and 600 
events. Table U summarizes the results of the study. 
With root's data compression turned off, a single 
instance of each application achieve approximately 25 
MB/s read or write rate on a hard drive which provide 
70 MB/sec sequential read. With compression turned 
on (providing 30% file size reduction), this figure falls 
to 4 and 16 MB/s read and write, respectively, illus- 
trating that (de)compression is the main bottleneck 
in input/output bound ROOT analyses. Ideally, the 
GPU can be used to eliminate this bottleneck. In or- 
der to observe the benefits of the fast random access of 
SSDs, we ran eight instances of these applications with 
the data stored on a single HD or SSD. We also ob- 
serve that only uncompressed data writing appears to 
be hmited by disk access. And though SSDs generally 
provide faster rates, the improvement is significantly 
less where the I/O is limited by (de)compression. 

4. GPGPUs 

From the Cell processor in the Playstation 3 (PS3) 
to in newer-generation Graphic Processing Units 
(CPUs) used in desktop and laptop computers, we 
find the building blocks for High-Performance Com- 
puting (HPC) systems already present in devices we 
daily use. Originally driven by the gaming industry, 
GPU architectures have recently been developed to 
also support general-purpose computing. These CPUs 
offer impressive power consumption/performance and 
price/performance ratios. The omnipresence of these 
general purpose CPUs have pushed industry lead- 
ers such as Microsoft and Apple into a race to de- 
velop strategies that take advantage their capabili- 
ties. The raw computational horsepower of CPUs is 
staggering. A single modern CPUs provides nearly 
one TFLOPS (10^^ floating-point operations per sec- 
ond) [l| , roughly 20 times more than a typical multi- 
core CPU. What's more, the trend over the past 
decade exhibits an exponential growth in the ratio of 
the computational power of GPU to CPU. 

While use of computer graphics hardware for 
general-purpose computation has been an area of ac- 
tive research for many years (eg 0, [1] [11]), the wide 
deployment of CPUs in the last several years has re- 
sulted in an increase in experimental research with 
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Table I Read/write rates for example ROOT analysis as function of number of simultaneous instances of analysis and 
ROOT compression level. 
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graphics hardware. Some notable work include pass- 
word cracking [4]] , artificial neural networks Q , solving 
partial differential equations (PDEs) @, line integral 
convolution and Lagrangian-Eulerian advection [?[, [1, 
and protein folding (Folding@Honie) [13] . In HEP, 
the developers of the FairRoot demonstrated a two or- 
ders of magnitude acceleration of track fitting in high 
multiplicity environments using GPUs 

Currently, the primary players in the GPU arena 
are NVIDIA and AMD (through the purchase of ATI). 
Nearly all of NVIDIA's GPUs, from the low-end lap- 
top GPU to the professional TESLA line, can be 
programmed using their Compute Unified Device Ar- 
chitecture (CUDA), a parallel programming compiler 
and software environment designed for issuing and 
managing general purpose computations on the GPU 
through extensions to the standard C language. There 
are also standard numerical libraries for EFT (East 
Fourier Transform) and BLAS (Basic Linear Algebra 
Subroutines) . 

ATI/ AMD has also developed a proprietary 
GPGPU hardware and software known as Data Par- 
allel Virtual Machine (DPVM). However, their fo- 
cus seems to now shifted to Open Computing Lan- 
guage (OpenCL), a GPU software standard originally 
developed by Apple and then released to an open 
consortium which includes all the relevant companies 
as signatories il2i |. The primary appeal of OpenCL 
is it's architecture independence which is achieved 
through run-time compilation of computing software 
kernels. The first publicly available implementation 
of OpenCL was recently released last week as part of 
Apple's Snow Leopard operating system. 

Other, less popular, GPU architectures and soft- 
ware include ClearSpeed [H, BrookGPU [13, AMD 
Stream Computing, and Sh [l^. Finally, a very 
promising upcoming product is Intel's Larabee, a x86- 
based many-core GPU which is rumored to be release 
later this year. 



4.1. Developing GPGPUs Applications 

The processing model for GPU is very different than 
CPU. Whereas CPUs are optimized for low latency, 
GPUs are optimized for high throughput. GPUs 
are essentially stream processors, hardware that op- 
erates in parallel by running a single computation 
on many records in a stream at once. The limita- 
tion is that each parallel computation may only pro- 
cess independent memory elements that cannot share 
memory with others. The result is that GPUs are 
generally suitable for computations that exhibit spe- 
cific characteristics. They must be compute intensive 
with large number of arithmetic operations per I/O 
or global memory reference. They must be amiable 
to data parallelism, where the same function is ap- 
plied to all records of an input stream and a number 
of records can be processed simultaneously without 
waiting for results from previous records. And they 
must exhibit data locality, a specific type of temporal 
locality common in signal and media processing ap- 
plications, where data is produced once, read once or 
twice later in the application, and never read again. 

HEP applications that are good candidates for 
GPGPU acceleration include Monte Carlo integra- 
tion for matrix element methods, discriminant train- 
ing or calculation in multivariate analysis, maximum- 
likelihood fitting, compression/decompression during 
data input/output, event generation, full or fast detec- 
tor simulation, event reconstruction (in particular for 
the High Level Trigger), and detector alignment. The 
difficulty with employing GPGPUs is that existing ap- 
plications cannot be simply rebuilt to run on GPG- 
PUs, but must rather treat the GPU as a co-processor. 
The software must explicitly adopt a data-parallel 
processing model where the data is broken into chunks 
and independently processed by algorithms which are 
highly constrained in both their memory access and 
complexity. A practical approach of GPGPU acceler- 
ating existing applications is to rewrite computational 
bottlenecks to prepare the data on the CPU, transfer 



Proceedings of the DPF-2009 Conference, Detroit, MI, July 27-31, 2009 



5 



it to the GPU memory, execute the computation on 
the GPU, and then transfer the results back. 

Even modest acceleration of detector simulation 
and reconstruction using GPGPU will have a signif- 
icant impact on HEP computing. For reconstruc- 
tion, where thousands of tracks and hundreds of thou- 
sands calorimeter cells must be processed, any gain di- 
rectly translates to higher trigger output. For simula- 
tion, where thousands of particles must be propagated 
through detector and magnetic fields, gains trans- 
late into fewer computing resource requirements. The 
practical time-scale for deployment of such GPGPU 
accelerated strategies is for LHC upgrade. While 
strategies for GPGPU acceleration of tracking and 
calorimetry are rather straight-forward, the complex- 
ity of Geant4 [l^ and the fact that it wasn't writ- 
ten with parallelization in mind, make simulation a 
much more difhcult problem. Taking full advantage 
of GPGPUs in simulation will likely only be possi- 
ble in the next generation software (perhaps Geant5). 
One promising interim strategy is to employ multi- 
ple parallel Geant4 threads running on the CPU (see 
Geant4 parallelization efforts of [17]) that offload spe- 
cific calculations (eg magnetic field extrapolation) to 
a service that can batch perform the calculation using 
the GPU. 
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